Data Visualization of COVID-19 DATA

Introduction

Given the high relevance of COVID-19 in 2020 and the impact it has on the vast majority of individuals around the globe, staying informed regarding the current status and spread of COVID around the world has become a daily task for many people. While the severity of the situation necessitates daily monitoring and continuous tracking of the virus, generating enormous amounts of data each day, this also makes it easy for the general population to be overwhelmed by the overabundance of COVID-related data made available from the scientific community. When dealing with such the issue of big datasets, data visualization can serve the critical function of making large data more accessible and comprehensible to the public and the decision makers at large. In our project, we will explore a few of these different graphical concepts that may facilitate a better understanding of the large amounts of information and see how such concepts can turn simple numbers into a relevant narrative.

In this project, we use datasets from two different sources.

The first dataset, containing racial and ethnic data on COVID-19 cases in the United States, comes from a collaboration between COVID Tracking Project and Boston University Center for Antiracist Research. This racial COVID data include the number of cases and deaths for different races and ethnicities that are reported across all 50 states. For this first dataset, marginal histogram is used to visualize whether COVID-19 affects different races disproportionately across the US.

The second dataset comes from Our World in Data, which gathers COVID-19 data from various government agencies across the globe. In addition to reporting more basic measures such as the number of cases, deaths, hospitalization, testing, etc., this dataset also includes social measures such as the government stringency index and the human development index. The government stringency index is calculated by considering various restrictions the government imposes on the citizens such as wearing a mask in public, workplace closure, international travel ban, etc. The human development index is calculated by considering various measures of achievement in key dimensions of human development in the country such as life expectancy and standard of living.

With this dataset, we’ll first demonstrate the usage of bubble plots. Bubble plot allows plotting of more than 2 variables with the weight or size of the bubble serving as the 3rd axis, and will be used to visualize the relationship between these social measures and the number of COVID-19 deaths in a country.

Using the same dataset, we’ll also demonstrate how to create an interactive plot. Interactive plot allows users to manipulate the ways in which the data can be viewed. Some of the ways in which one can interact with the plot may be as simple as being able to zoom in and out of different sections of the plot, include or exclude certain variables in the plot, or click on a particular bar or point to see the actual value. We will walk through the implementation of these mentioned interactive functions into our plots. This interactive plot will be a bar graph showing the number of COVID-19 cases, deaths, hospitalizations, and testing numbers in 5 different Eastern European countries.

Software/Tools

In this tutorial, we are using R, STATA, MATLAB and Python to show how these graphical concepts may be implemented.

Software Relevant Libraries
R tidyverse (manipulation of data), ggplot2 (plotting), cowplot (plotting), ggplotly (adding interactive functionality)
STATA the built-in command scatter and twoway histogram used to create the three subplots in marginal plots, and user-written command grc1leg is used to combine these subplots into one figure with a command legend. Scatter is also used to create the bubble plot. In the extended example of multi-categories histograms, user-written commands tabcount (calculate frequency over categories), bihist and byhist (plotting histograms) are also used.
Python numpy (manipulation of data) , pandas (manipulation of data), matplotlib (plotting), seaborn (plotting), plotly (adding interactive functionality)
MATLAB Clickable Legend Toolbox (adding interactive functionality) and subaxis (allow margin adjustment on subplot) from MATLAB File Exchange, plotly-graphing-library-for-matlab (adding interactive functionality) from plotly (bubblechart requires ). ** NOTE: bubblechart function used in plotting bubblechart requires MATLAB_R2020b.

Marginal Histogram

A scatter plot is one of the most frequently used visualizations. While it has the benefits of retaining the exact data values and sample size, it can be difficult to see the patterns behind those massive number of data points, such as how distributions vary along the axes and across different categories. The combination of scatter plots and marginal plots (usually histograms or box plots) allows us to exploit the benefits of both plot types and observe the connection between individual data points and their aggregation.
In this example, we use a scatter plot to display the # of cases and deaths across the four racial groups, and introduce marginal histograms of both axes to analyze the distribution.

  • Variables
    • ID: State & Date
    • Race
    • #Cases
    • #Deaths
  • This figure presents only the Nov 1, 2020 data.
  • The Race Data Entry file specifies 9 different racial and ethnicity categories: White, Black, LatinX (Hispanic or Latino), Asian, AIAN (American Indian or Alaska Native), NHPI (Native Hawaiian and Pacific Islander), Multiracial, Other, and Unknown. Since plotting of all those groups would result in a scatterplot that was overcrowded and confusing to interpret, we narrowed down the dataset to only look at 4 different racial groups: White, Black, Asian, and LantinX.

R

# setup
library(tidyverse)
Reading in and pre-processing the data
filename_racial = "./Data/Race Data Entry - CRDT.csv"
racial_data = read_delim(
  filename_racial, delim = ",",
  col_types = cols(
   .default = col_integer(), 
   Date = col_character(),
   State = col_character()
  )
) %>% 
  pivot_longer(
    cols = starts_with(c("Cases", "Deaths")), 
    names_pattern = "(Cases|Deaths)_(.*)",
    names_to = c(".value", "Race")
  ) %>%
  filter(!str_detect(Race, "Total|Ethnicity"))

date_picked = "20201101"
race_picked = c("White", "Black", "LatinX", "Asian")
plot1_data = racial_data %>%
  filter(
    Date == date_picked, 
    str_detect(Race, paste0(race_picked, collapse = "|"))
  ) %>%
  mutate(
    Race = factor(Race, levels = race_picked, ordered = TRUE)
  )
Format of plot1_data
Date State Race Cases Deaths
20201101 AK White 5063 31
20201101 AK Black 574 3
20201101 AK LatinX NA NA
20201101 AK Asian 680 8
20201101 AL White 59056 1555
Number of NA values in plot1_data
Date State Race Cases Deaths
0/224 0/224 0/224 51/224 54/224
Generating the main plot
plot1_title = sprintf(
  "%s (%s)",
  "Total confirmed COVID-19 deaths vs. cases, U.S. States",
  format(as.Date(date_picked, "%Y%m%d"), "%m/%d/%y")
)
palette_picked = "Set1" # pick a color palette
# use `RColorBrewer::display.brewer.all()` to check out available palettes

pic1_main = plot1_data %>%
  ggplot(aes(x = Cases, y = Deaths, color = Race)) +
  theme_bw() +
  geom_point(size = 2, alpha = .7) +
  scale_color_brewer(palette = palette_picked) +
  ggtitle(plot1_title) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 12),
    axis.title = element_text(size = 9)
  )
print(pic1_main)

Creating marginal plots

ggExtra::ggMarginal()

pic1 = pic1_main + 
  theme(legend.position = "left") # move legend to the left side
  # because ggMarginal() adds margin plots on the right side

# marginal histogram
ggExtra::ggMarginal(
  pic1, type="histogram", 
  groupColour = TRUE, groupFill = TRUE
)

We can create marginal box plots by setting type = "boxplot". The parameter size specifies the relative size of the main plot to the marginal ones. A size of 5 means that the main plot is 5x wider and 5x taller than the marginal plots.

ggExtra::ggMarginal(
  pic1, type="boxplot", size = 5,
  groupColour = TRUE, groupFill = TRUE
)

Although ggMarginal() provides a easy solution to marginal plots, it is not friendly to customization. For example, we cannot set the two marginal plots to be of different types or of different sizes. Neither can we specify the exact size for marginal plots in units such as inch.
Check out the cowplot section for a more flexible solution to creating marginal plots in R.

Cowplot

Using package cowplot to create marginal plots is less succinct, but more flexible. Unlike ggExtra::ggMarginal, cowplot allows:

  • putting the marginal plots on the left/bottom side of the main image;
  • creating a marginal histogram for one axis, and a box plot/density plot for another.
pic1_xhist = cowplot::axis_canvas(pic1_main, axis = "x") +
  geom_histogram(
    data = plot1_data %>% filter(!is.na(Cases) & !is.na(Deaths)), 
    aes(x = Cases, fill = Race, color = Race),
    bins = 30, alpha = .6
  ) +
  scale_fill_brewer(palette = palette_picked) +
  scale_color_brewer(palette = palette_picked)

pic1_yhist = cowplot::axis_canvas(pic1_main, axis = "y", coord_flip = TRUE) +
  geom_histogram(
    data = plot1_data %>% filter(!is.na(Cases) & !is.na(Deaths)), 
    aes(x = Deaths, fill = Race, color = Race),
    bins = 30, alpha = .6
  ) +
  scale_fill_brewer(palette = palette_picked) +
  scale_color_brewer(palette = palette_picked) +
  coord_flip()

pic1_xhist
pic1_yhist


{pic1_main + 
  theme(legend.position = "left") # move legend to the left
  # to make room for margin plots
} %>%
  cowplot::insert_xaxis_grob(
    pic1_xhist, 
    height = grid::unit(.8, "in"), 
    position = "top"
  ) %>%
  cowplot::insert_yaxis_grob(
    pic1_yhist, 
    width = grid::unit(.8, "in"), 
    position = "right"
  ) %>%
  cowplot::ggdraw()

Marginal plots on the bottom/left side

We can move the marginal plots from the top/right side of the original figure to the bottom/left side.
To do that, we may want to flip the two previously created marginal histograms using scale_y_reverse(). To make room for the marginal plots, we can move the x-axis and y-axis to the top/right side of the original figure by setting the position parameter in scale_*_continuous().

{pic1_main +
  # move the axis labels to the top/right to make room for margin plots
  scale_x_continuous(position = "top") +
  scale_y_continuous(position = "right")
} %>%
  cowplot::insert_xaxis_grob(
    pic1_xhist + scale_y_reverse(), 
    height = grid::unit(.8, "in"), 
    position = "bottom"
  ) %>%
  cowplot::insert_yaxis_grob(
    pic1_yhist + scale_y_reverse(), 
    width = grid::unit(.8, "in"), 
    position = "left"
  ) %>%
  cowplot::ggdraw()

Marginal plots of different types

Using cowplot, we can easily specify different plot types for the x,y-axis marginal plots. Below is the code example to replace the y-axis marginal histogram with a box plot.

pic1_ybox = cowplot::axis_canvas(pic1_main, axis = "y") +
  geom_boxplot(
    data = plot1_data %>% filter(!is.na(Cases) & !is.na(Deaths)),
    aes(x = 0, y = Deaths, fill = Race, color = Race), 
    alpha = .6
  ) +
  scale_fill_brewer(palette = palette_picked) +
  scale_color_brewer(palette = palette_picked)

{pic1_main + 
  theme(legend.position = "left")
} %>%
  cowplot::insert_xaxis_grob(
    pic1_xhist, 
    height = grid::unit(.8, "in"), 
    position = "top"
  ) %>%
  cowplot::insert_yaxis_grob(
    pic1_ybox, 
    width = grid::unit(.8, "in"), 
    position = "right"
  ) %>%
  cowplot::ggdraw()

We can even call cowplot::insert_yaxis_grob() twice to add multiple margin plots for the y axis (although probably unnecessary).

STATA

In the STATA example, we will not be using stacked histograms grouped by variable race like we do in the other examples. That is because:

  1. generating stacked histograms can be complicated using STATA (we will include a discussion on how to generate histograms with multiple groups in STATA right after the tutorial on marginal plots);
  2. stacked histograms generated using command graph bar cannot line up with the scatter plot, even if we explicitly set the x,y-axis range to be the same, while simple histograms generated using twoway histogram can line up with the scatter plot.
Reading in and pre-processing the data
cd "E:\git\Stats506_midterm_project\STATA\"
import delimited "..\Data\Race Data Entry - CRDT.csv", ///
  stringcols(1) numericcols(3/28) clear

keep if date == "20201101"
keep state *white *black *latinx *asian

reshape long cases_ deaths_, i(state) j(race) string
rename cases_ cases
rename deaths_ deaths

// turn race into a ordered factor
gen race_int = 1
replace race_int = 2 if race == "black"
replace race_int = 3 if race == "latinx"
replace race_int = 4 if race == "asian"

label define race_label 1 "White" 2 "Black" 3 "LatinX" 4 "Asian"
label values race_int race_label 
drop race
rename race_int race

// Drop observations with cases or deaths values being NA
// Keeping these NA values will not affect scatter plot -- either axis being 
// NA, the scatter won't show. But it does affect histograms since they will 
// be based on a single variable (cases or deaths)
drop if cases == . | deaths == .

// categorize deaths (y-axis variable) into four groups by race
separate deaths, by(race) veryshortlabel

save race_plot_data, replace
Creating marginal plots

We will be Using the user written command grc1leg by Vince Wiggins to create the marginal plots.

grc1leg. Combine graphs into one graph with a common legend. / Program by Vince Wiggins, StataCorp . / Statalist distribution, 16 June 2003. / Distribution-Date: 02jun2010

Use the following line to install grc1leg:

net install grc1leg, from(http://www.stata.com/users/vwiggins/)

Note:

  1. Run the code chunk from ‘local alpha’ to ‘twoway histogram cases’ at once to make sure the local variables can be used by all three subplots.
  2. The optacity settings (e.g. mcolor(red%30)) are only available in STATA15 or above. For STATA14 or below, try using hollow circles instead of solid ones by adding the option msymbol(Oh).
  • Creating top/right margin plots
use race_plot_data, clear

// set opacity
local alpha %30 
// set x,y-axis limits for both the main plot and the marginal plots
// otherwise the axis scales might differ across the three plots
local xscale_opt xscale(range(0 415000)) 
local yscale_opt yscale(range(0 10500))

// create the main plot
// set color for the four racial groups
local scatter_opt mcolor(red`alpha' blue`alpha' green`alpha' purple`alpha')
scatter deaths? cases, `scatter_opt' ///
  `xscale_opt' `yscale_opt' ///
  legend(subtitle("Race") position(9) cols(1) region(style(none))) ///
  ytitle("Deaths") xtitle("Cases") xlabel(, grid) ///
  saving(yx_tr, replace)

// create the two marginal histograms
local hist_opt color(navy`alpha') bin(30)
twoway histogram deaths, `hist_opt' fraction ///
  `yscale_opt' ysca(alt) horiz fxsize(25) ///
  ytitle("") xlabel(, nogrid) ///
  saving(hy_tr, replace)
twoway histogram cases, `hist_opt' fraction ///
  `xscale_opt' xsca(alt) fysize(25) ///
  xtitle("") xlabel(, grid) ylabel(, nogrid) ///
  saving(hx_tr, replace)

// group the three plots together using 'grc1leg'
// 'colfirst' asks that the three plots will be displayed down column first
// 'legendfrom(yx_tr.gph)' specify using the main plot's legend
// 'position(9)' means putting the legend at 9 o'clock position
local graph_title = "Total confirmed COVID-19 deaths vs. cases, " + ///
                    "U.S. States (11/01/20)"
grc1leg hx_tr.gph yx_tr.gph hy_tr.gph, ///
  colfirst hole(3) imargin(0 0 0 0) ///
  title("`graph_title'", size(medium)) ///
  legendfrom(yx_tr.gph) position(9)
  
// save the final output
graph export stata-marginplot-tr.png, replace

// remove the intermediate files from disk
erase hx_tr.gph
erase yx_tr.gph
erase hy_tr.gph

  • Creating bottom/left margin plots
// create the main plot
local alpha %30
local xscale_opt xscale(range(0 415000)) 
local yscale_opt yscale(range(0 10500))

local scatter_opt mcolor(red`alpha' blue`alpha' green`alpha' purple`alpha')
scatter deaths? cases, `scatter_opt' ///
  `xscale_opt' `yscale_opt' ///
  legend(subtitle("Race") position(3) cols(1) region(style(none))) ///
  ytitle("Deaths") xtitle("Cases") ///
  xlabel(, grid) ysca(alt) xsca(alt) /// 
  saving(yx_bl, replace)

// create the two marginal histograms
local hist_opt color(navy%30) bin(30)
twoway histogram deaths, `hist_opt' fraction ///
  `yscale_opt' xsca(alt reverse) horiz fxsize(25) ///
  ytitle("") xlabel(, nogrid) ///
  saving(hy_bl, replace)
twoway histogram cases, `hist_opt' fraction ///
  `xscale_opt' ysca(alt reverse) fysize(25) ///
  xtitle("") xlabel(, grid) ylabel(, nogrid) ///
  saving(hx_bl, replace)

// group the three plots together using 'grc1leg'
// unlike the top-right marginal plot, we do not specify'colfirst' this time
// so that the three plots will be displayed along row first
// legendfrom(yx_bl.gph) specify using the main plot's legend
// position(3) means putting the legend at 3 o'clock position
local graph_title = "Total confirmed COVID-19 deaths vs. cases, " + ///
                    "U.S. States (11/01/20)"
grc1leg hy_bl.gph yx_bl.gph hx_bl.gph, ///
  hole(3) imargin(0 0 0 0) ///
  title("`graph_title'", size(medium)) ///
  legendfrom(yx_bl.gph) position(3)

// save the final output
graph export stata-marginplot-bl.png, replace

// remove the intermediate files from disk
erase hy_bl.gph
erase yx_bl.gph
erase hx_bl.gph


Extension: Generating histograms with multiple categories in STATA

As mentioned above, stacked histograms can be difficult to generate in STATA, and they are not suitable as marginal plots since they cannot properly line up with the main scatter plot.
However, as an extended example, we will walk through the code to create a stacked histogram of variable cases grouped by race, using the racial data race_plot_data.dta in the marginal plot examples and the graph bar command.

Data and summary statistics
cd "E:\git\Stats506_midterm_project\STATA\" 
use race_plot_data, clear
// check out the data format
list state race cases in 1/5
// check out the summary statistics for variable 'cases'
summarize cases
 . cd "E:\git\Stats506_midterm_project\STATA\" 
E:\git\Stats506_midterm_project\STATA

. use race_plot_data, clear

. // check out the data format
. list state race cases in 1/5

     +-----------------------+
     | state    race   cases |
     |-----------------------|
  1. |    AK   Asian     680 |
  2. |    AK   Black     574 |
  3. |    AK   White    5063 |
  4. |    AL   Asian     628 |
  5. |    AL   Black   38319 |
     +-----------------------+

. // check out the summary statistics for variable 'cases'
. summarize cases

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       cases |        165    32516.75    51177.92         74     401503
Creating stacked histogram using graph bar

Since the graph bar command is not designed for histograms, we need to manually turn the continuous variable cases into intervals and store as a categorical variable cut_cases. After that, we can calculate the frequency of each intervals and produce the histogram.

From the output of summarize cases, we know that values of cases falls in range [74, 401503]. Therefore, we can cut it into 30 intervals from 0 to 402000.

Note: since R’s library Statamarkdown runs each STATA code chunk as an stand-alone .do file, the following code chunks will contain some repetitive code.

cd "E:\git\Stats506_midterm_project\STATA\" 
use race_plot_data, clear
// use 'egen ... cut' with option 'at' to turn 'cases' into intervals
egen cut_cases = cut(cases), at(0(13400)402000)
list cases cut_cases in 1/10
 . cd "E:\git\Stats506_midterm_project\STATA\" 
E:\git\Stats506_midterm_project\STATA

. use race_plot_data, clear

. // use 'egen ... cut' with option 'at' to turn 'cases' into intervals
. egen cut_cases = cut(cases), at(0(13400)402000)

. list cases cut_cases in 1/10

     +------------------+
     | cases   cut_ca~s |
     |------------------|
  1. |   680          0 |
  2. |   574          0 |
  3. |  5063          0 |
  4. |   628          0 |
  5. | 38319      26800 |
     |------------------|
  6. | 59056      53600 |
  7. |  1130          0 |
  8. | 22837      13400 |
  9. | 67721      67000 |
 10. |  2745          0 |
     +------------------+

Interpretation of cut_cases: value 0 represents the range [0, 13400), value 13400 represents the range [13400, 26800), etc.

We then use tabcount to calculate the number of observations for each level of cut_cases and race. Install tabcount using ssc install tabcount if you have not already done that.

cd "E:\git\Stats506_midterm_project\STATA\" 
use race_plot_data, clear
egen cut_cases = cut(cases), at(0(13400)402000)

quietly tabcount cut_cases race, v1(0(13400)402000) v2(1/4) replace
list in 1/20
 . cd "E:\git\Stats506_midterm_project\STATA\" 
E:\git\Stats506_midterm_project\STATA

. use race_plot_data, clear

. egen cut_cases = cut(cases), at(0(13400)402000)

. 
. quietly tabcount cut_cases race, v1(0(13400)402000) v2(1/4) replace

. list in 1/20

     +--------------------------+
     | _freq   cut_ca~s    race |
     |--------------------------|
  1. |    11          0   White |
  2. |     5      13400   White |
  3. |     4      26800   White |
  4. |     4      40200   White |
  5. |     7      53600   White |
     |--------------------------|
  6. |     4      67000   White |
  7. |     2      80400   White |
  8. |     3      93800   White |
  9. |     2     107200   White |
 10. |     3     120600   White |
     |--------------------------|
 11. |     1     134000   White |
 12. |     1     147400   White |
 13. |     0     160800   White |
 14. |     1     174200   White |
 15. |     0     187600   White |
     |--------------------------|
 16. |     0     201000   White |
 17. |     1     214400   White |
 18. |     0     227800   White |
 19. |     0     241200   White |
 20. |     0     254600   White |
     +--------------------------+

It is important to use tabcount here rather than a simple bysort command (see the following code chunk). tabcount cut_cases race, v1(0(13400)402000) v2(1/4) will preserve every intervals of cases from [0, 13400) to [388600, 402000), even if there is no observation in that range. To create a histogram, we will need the zero-frequency intervals.
On the other hand, bysort only keeps the levels that actually appear in the dataset (with non-zero frequencies), which means the intervals of cases without data points will not be preserved.

cd "E:\git\Stats506_private\group_proj\STATA\" 
use race_plot_data, clear
egen cut_cases = cut(cases), at(0(13400)402000)

// DO NOT USE THIS METHOD for the histogram!
bysort cut_cases race: generate freq = _N
keep cut_cases race freq
duplicates drop
list
 . cd "E:\git\Stats506_private\group_proj\STATA\" 
E:\git\Stats506_private\group_proj\STATA

. use race_plot_data, clear

. egen cut_cases = cut(cases), at(0(13400)402000)

. 
. // DO NOT USE THIS METHOD for the histogram!
. bysort cut_cases race: generate freq = _N

. keep cut_cases race freq

. duplicates drop

Duplicates in terms of all variables

(134 observations deleted)

. list

     +--------------------------+
     |   race   cut_ca~s   freq |
     |--------------------------|
  1. |  White          0     11 |
  2. |  Black          0     25 |
  3. | LatinX          0      6 |
  4. |  Asian          0     45 |
  5. |  White      13400      5 |
     |--------------------------|
  6. |  Black      13400      8 |
  7. | LatinX      13400      4 |
  8. |  White      26800      4 |
  9. |  Black      26800      5 |
 10. | LatinX      26800      7 |
     |--------------------------|
 11. |  Asian      26800      1 |
 12. |  White      40200      4 |
 13. |  Black      40200      6 |
 14. | LatinX      40200      2 |
 15. |  White      53600      7 |
     |--------------------------|
 16. |  Black      53600      1 |
 17. |  White      67000      4 |
 18. | LatinX      67000      1 |
 19. |  White      80400      2 |
 20. |  White      93800      3 |
     |--------------------------|
 21. |  Black      93800      1 |
 22. | LatinX      93800      1 |
 23. |  White     107200      2 |
 24. |  Black     107200      1 |
 25. |  White     120600      3 |
     |--------------------------|
 26. |  White     134000      1 |
 27. |  White     147400      1 |
 28. |  White     174200      1 |
 29. |  White     214400      1 |
 30. | LatinX     227800      1 |
     |--------------------------|
 31. | LatinX     388600      1 |
     +--------------------------+

With the frequency table generated using tabcount, we can now use command graph bar (asis) _freq, over(race) over(cut_cases) asyvar stack to create the stacked histograms. Option asyvar asks that the frequencies be stacked over race – the first over() group.
Since the tabcount command with option replace will replace the current dataset with the frequency table, we can use the preserve and restore commands to create a snapshot of the current data and switch back to it later.

cd "E:\git\Stats506_private\group_proj\STATA\" 
use race_plot_data, clear
egen cut_cases = cut(cases), at(0(13400)402000)

preserve
quietly tabcount cut_cases race, v1(0(13400)402000) v2(1/4) replace
graph bar (asis) _freq, over(race, sort(race) descending) ///
  over(cut_cases, gap(0) label(nolabels)) asyvar stack
graph export stata-stacked_histogram-graph_bar.png, replace
restore

We can further specify the colors for each racial group:

cd "E:\git\Stats506_private\group_proj\STATA\" 
use race_plot_data, clear
egen cut_cases = cut(cases), at(0(13400)402000)

preserve
quietly tabcount cut_cases race, v1(0(13400)402000) v2(1/4) replace
local alpha %40
graph bar (asis) _freq, over(race, sort(race) descending) ///
  over(cut_cases, gap(0) label(nolabels)) asyvar stack ///
  bar(1, color(red`alpha')) bar(2, color(blue`alpha')) ///
  bar(3, color(green`alpha')) bar(4, color(purple`alpha'))
graph export stata-stacked_histogram-graph_bar-colored.png, replace
restore

Much simpler solutions: bihist & byhist

Apart from the stacked histogram, there are two much simpler methods to add another dimension of information in histograms: (i) using command bihist to create histograms with exactly two categories; and (ii) using command byhist to create interlaced histograms with two or more categories.

These user written commands are designed for histograms, which saves us the trouble of computing frequency over intervals. Therefore, the code is much more succinct and easy to remember.

Install bihist and byhist using ssc install bihist and ssc install byhist if you have not already done that.

use race_plot_data, clear
bihist cases if race == 1 | race == 2, by(race)

use race_plot_data, clear
byhist cases, by(race)

MATLAB

Reading in and pre-processing the data

Prior to any plotting, we’ll organize the csv file so it only contains the relevant data necessary for plotting.

%% Date Prep 
% Read in data
fileName = 'Race Data Entry - CRDT.csv';
raceData = readtable(fileName);

%Select only 11/01/20 data
newestDate = 20201101;
raceData = raceData(raceData.Date == newestDate,:);

% Convert all case/death # to double
colnames = raceData.Properties.VariableNames;
for i = 3:length(colnames)
    if iscell(raceData.(genvarname(colnames{i})))
        raceData.(genvarname(colnames{i})) = ...
        str2double(raceData.(genvarname(colnames{i})));
    end
end

Let’s narrow down the dataset to only look at 4 different racial groups: White, Black, Latinx, and Asian since including all available racial categories would provide too much information on a single figure.

%% Get race info
% Grabbing the different races identified in dataset
race_idx = find(contains(colnames, 'Deaths'));
race_idx = race_idx(2 : 10); % remove total and ethnicity info
races = raceData(end, race_idx);
raceID =  strrep(races.Properties.VariableNames, 'Deaths_','');

% selecting relevant columns
plotData = raceData(1 : end, {'Date', 'State', 'Cases_White', ...
    'Cases_Black', 'Cases_LatinX', 'Cases_Asian', 'Deaths_White', ...
    'Deaths_Black', 'Deaths_LatinX', 'Deaths_Asian'});

% reshape case variable for plotting
case_col = startsWith(plotData.Properties.VariableNames, 'Cases');
cases = table2array(plotData(:, case_col));
cases = reshape(cases, [], 1);

% reshape death variable for plotting
death_col = startsWith(plotData.Properties.VariableNames, 'Deaths');
deaths = table2array(plotData(:, death_col));
deaths = reshape(deaths, [], 1);

MATLAB’s scatterhist requires that we pass in a group label that is the same dimension as the x and y datapoints being plotted, so we create a new array, race_label, containing matched labels.

% Generate race labels matching the column vector for cases/deaths
race_label = [];
for i = 1 : 4
    race_label = [race_label; repmat(raceID(i), size(raceData,1), 1)];
end
Creating the marginal plots
scatterhist

Now that all the variables are ready, we use the scatterhist function to create a scatter plot with histogram plotted along the margins. Here, we specify group label colors to match colors used in R plot. We also specify the location where we want the histograms to be plotted with respect to the scatter plot using argument pair (‘Location’, ‘NorthEast’) and the direction that we want the histogram to point by argument pair (‘Direction’, ‘out’).

x = cases;
y = deaths;

% Specify color to be used for each race grouping
red = [236, 95, 97] / 255;
blue = [115, 165, 205] / 255;
green = [131, 199, 129] / 255;
purple = [173, 113, 182] / 255;
c_array = [red; blue; green; purple];

figure
% scatterhist does not allow us to adjust the transparency of the plots, 
% to do this we can generate marginal histogram from scratch using subplot
h = scatterhist(x, y, 'Group', race_label, 'Style', 'bar', ...
    'marker', '.', 'Location', 'NorthEast', 'Direction','out', ...
    'color', c_array, 'MarkerSize', 16);

[leg,~] = legend('show');
title(leg,'Race')
xlabel('Cases')
ylabel('Deaths')
sgtitle('Total confirmed COVID-19 deaths vs. cases, U.S. States (11/03/20)') 

While the built-in function scatterhist is useful for quick plotting because many properties of the figure and the layout do not have to be specified, this function faces the downside of not being able to customize various settings which the user may be looking to specify. Alternatively to scatterhist, we may use subplot functionality to get around this issue by manually creating a marginal histogram by combining the 3 components (scatterplot + 2 histograms). Please refer to the section ‘subplot’ for the walkthrough.

subplot

In MATLAB, we can also manually create a marginal histogram by arranging together 2 histograms and a scatterplot into a single figure. Because we’re working with multiple, separate functions like hist and scatter with distinct plot handles for each group, we have much larger customizability of the arranged plots, whereas scatterhist prevents users from grabbing all of the figure’s property handles and modifying them, like transparency and bin width specific to each histogram.

We’ll be using the toolbox ‘subaxis’ which builds off the built-in subplot functionality in order to minimize the margins between each subplots. This toolbox can be directly downloaded from link but it is also included in the repository containing this tutorial.

addpath('subaxis')

Working off the preprocessed data from earlier steps above, we now have to manually plot a histogram for each race and overlay them on top of one another. We do this two times for each of the two histograms (deaths and cases) that’s needed for a marginal histogram.

subaxis functions allows us to arrange multiple plots into a single figure and additionally allows us to specify the margins surrounding each subplot.

%% Using subplot to create marginal histogram
% breaking race data into separate arrays
white = find(race_label == "White");
black = find(race_label == "Black");
latinx = find(race_label == "LatinX");
asian = find(race_label == "Asian");
  • Creating top/right margin plots
figure(1)
clf
% y-data histogram
ah1 = subaxis(2, 2, 4, 'sh', 0, 'sv', 0.01, 'padding', 0, 'margin', 0.1);
p1 = histogram(deaths(white), 'Orientation', 'horizontal', ...
    'Normalization', 'probability', 'BinWidth', 1000, ...
    'Facecolor', red);
hold on
p2 = histogram(deaths(black), 'Orientation', 'horizontal', ...
    'Normalization', 'probability', 'BinWidth', 1000, ...
    'Facecolor', blue);
hold on
p3 = histogram(deaths(latinx), 'Orientation', 'horizontal', ...
    'Normalization', 'probability', 'BinWidth', 1000, ...
    'Facecolor', green);
hold on
p4 = histogram(deaths(asian), 'Orientation', 'horizontal', ...
    'Normalization', 'probability', 'BinWidth', 1000, ...
    'Facecolor', purple);

% x-data histogram
ah2 = subaxis(2, 2, 1,'sh', 0.03, 'sv', 0.03, 'padding', 0, 'margin', 0.1);
histogram(cases(white), 'Normalization', 'probability', ...
    'BinWidth', 20000, 'Facecolor', red)
hold on
histogram(cases(black), 'Normalization', 'probability', ...
    'BinWidth', 20000, 'Facecolor', blue)
hold on
histogram(cases(latinx), 'Normalization', 'probability', ...
    'BinWidth', 20000, 'Facecolor', green)
hold on
histogram(cases(asian), 'Normalization', 'probability', ...
    'BinWidth', 20000, 'Facecolor', purple)

The transparency of the scatter points can now be specified because this feature is supported through built-in function scatter, using the argument Markerfacealpha. I have specified 60% (denoted as 0.6) opacity for all the points.

% scatterplot
ah3 = subaxis(2, 2, 3, 'sh', 0.03, 'sv', 0.01, ...
    'padding', 0, 'margin', 0.1);
hold on
scatter(cases(white), deaths(white), 'filled', ...
     'MarkerFaceColor', red, 'Markerfacealpha', 0.6)
scatter(cases(black), deaths(black), 'filled', ...
    'MarkerFaceColor', blue, 'Markerfacealpha', 0.6)
scatter(cases(latinx), deaths(latinx), 'filled', ...
    'MarkerFaceColor', green, 'Markerfacealpha', 0.6)
scatter(cases(asian), deaths(asian), 'filled', ...
    'MarkerFaceColor', purple, 'Markerfacealpha', 0.6)

We need to remove all the boxes and axes information from the histograms since we’re combining them with the scatterplot. Similar to function scatterhist‘s argument ’Direction’ specifying the direction which the histogram plots, View property of the subplots can be adjusted to rotate the histograms in the desired direction (commented out in the code below ah1.View = [180, -90]). Similarly, the arrangement of the subplot is reponsible for specifying the location of the histograms, whether they’re to the top or bottom, and left or right of the scatter plot.

% Remove boxes and axes information from the histogram
linkaxes([ah1, ah3], 'y')
linkaxes([ah3, ah2], 'x')
ah1.Box = 'off';
% ah1.View = [180, -90];
ah1.Visible = 'off';
ah2.Visible = 'off';
ah2.Box = 'off';
% ah2.View = [0, -90];

We finish off by creating a single universal legend for all three plots. Since this will not happen automatically when manually plotting and arranging a marginal histogram.

% Create single universal legend
h = [p1;p2;p3;p4];
hold on
lgd = legend(h, 'White', 'Black', 'Latinx', 'Asian');
lgd.Box = 'off';
newPosition = [0.5 0.6 0.13 0.13];
newUnits = 'normalized';
set(lgd, 'Position', newPosition, 'Units', newUnits);
title(lgd, 'Races')

% Give title and label axis in scatterplot
sgtitle('Total confirmed COVID-19 deaths vs. cases, U.S. States (11/03/20)')
xlabel('Cases')
ylabel('Deaths')

Python

Setting up packages
import numpy as np
import pandas as pd
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
Reading in and pre-processing the data

Before plotting, we organize the csv file Race Data Entry - CRDT.csv which contains the data we need for plotting the marginal histogram. We set the data to be Nov 1, 2020 which stays the same with other parallel examples.

df = pd.read_csv('./Data/Race Data Entry - CRDT.csv')
new = df[df['Date'] == 20201101]

To avoid too much information showed in a single plot, we are focusing on four different racial groups: White, Black, LatinX, and Asian.

new = new[['Cases_Asian', 'Cases_Black', 'Cases_LatinX','Cases_White',
           'Deaths_Asian', 'Deaths_Black', 'Deaths_LatinX', 'Deaths_White']]
new = new.dropna(axis=0, how='any') # drop rows that have any NaN

x1 = new['Cases_Asian']
y1 = new['Deaths_Asian']
x2 = new['Cases_Black']
y2 = new['Deaths_Black']
x3 = new['Cases_LatinX']
y3 = new['Deaths_LatinX']
x4 = [float(x) for x in new['Cases_White']]
y4 = new['Deaths_White']
Plot settings

Here we specify some important variables which formulate our further plotting. These variables (width, length, spacing, bottom, etc) determine the size of main-plot and marginal-plots.

left, width = 0.1, 0.65
bottom, height = 0.1, 0.65
spacing = 0.005

rect_scatter = [left, bottom, width, height]
rect_histx = [left, bottom + height + spacing, width, 0.2]
rect_histy = [left + width + spacing, bottom, 0.2, height]

xmax = max(np.max(x1), np.max(x2), np.max(x3))
ymax = max(np.max(y1), np.max(y2), np.max(y3))
binsx = np.arange(0, xmax + 1000, 9000)
binsy = np.arange(0, ymax + 200, 200)
  • Creating the marginal plot using matplotlib We can divide the plotting into 3 steps:
    • Plotting the main frame. Adding corresponding labels & legends.
    • Adding the boxes for marginal plots
    • For each race, making the scatter plot in the main square and histograms showing marginal distributions in the marginal boxes.
fig = plt.figure(figsize=(11, 8))
ax = fig.add_axes(rect_scatter)
plt.xlabel('Cases');
plt.ylabel('Deaths');
plt.title('Total confirmed COVID-19 deaths vs. cases, U.S. States (11/01/20)',
          x = 0.6, y = -0.16, fontsize = 14);
plt.grid(color='LightGrey')
purple_patch = mpatches.Patch(color='purple', label='Asian')
blue_patch = mpatches.Patch(color='blue', label='Black')
green_patch = mpatches.Patch(color='green', label='LatinX')
red_patch = mpatches.Patch(color='red', label='White')
ax_histx = fig.add_axes(rect_histx, sharex=ax)
ax_histy = fig.add_axes(rect_histy, sharey=ax)
plt.legend(handles=[red_patch, blue_patch, green_patch, purple_patch],
           title='Race', loc = 'best');

ax.scatter(x1, y1, color = 'purple', alpha = 0.5);
ax_histx.hist(x1, bins=binsx, color = 'purple', alpha = 0.5);
ax_histy.hist(y1, bins=binsy, orientation='horizontal', 
              color = 'purple', alpha = 0.5);
ax.scatter(x2, y2, color = 'blue', alpha = 0.5);
ax_histx.hist(x2, bins=binsx, color = 'blue', alpha = 0.5);
ax_histy.hist(y2, bins=binsy, orientation='horizontal', 
              color = 'blue', alpha = 0.5);
ax.scatter(x3, y3, color = 'green', alpha = 0.5);
ax_histx.hist(x3, bins=binsx, color = 'green', alpha = 0.5);
ax_histy.hist(y3, bins=binsy, orientation='horizontal', 
              color = 'green', alpha = 0.5);
ax.scatter(x4, y4, color = 'red', alpha = 0.5);
ax_histx.hist(x4, bins=binsx, color = 'red', alpha = 0.5);
ax_histy.hist(y4, bins=binsy, orientation='horizontal', 
              color = 'red', alpha = 0.5);

plt.show()
# plt.savefig("python_marginal.png", dpi=200, bbox_inches='tight')


Bubble plot

In our project, we use bubble plot to see the relationship between stringency index, human development index, and total # of deaths in each country. Each bubble represents a single country with x-axis measuring the stringency index, y-axis measuring the human development index, and the size of the bubble measuring the total number of deaths in the country at the time of 10/20/20. Bubble plot is useful because it allows us to easily visualize 3 dimensions of the data rather than the conventional 2.

  • Variables
    • ID: location(iso_code) & date
    • continent
    • total_deaths_per_million
    • stringency_index (0 - 100, with 100 most strict)
    • human_development_index (0 - 1, with 1 most developed)
  • This figure presents only the Oct 20, 2020 data.

R

# setup
library(tidyverse)
Reading in and pre-processing the data
filename_covid = "./Data/owid-covid-data.csv"
plot2_data = read_delim(filename_covid, delim = ",") %>%
  select(date, iso_code, continent, total_deaths_per_million,
         stringency_index, human_development_index) %>%
  filter(date == "2020-10-20")
Format of plot2_data
date iso_code continent total_deaths_per_million stringency_index human_development_index
2020-10-20 ABW North America 318.453 50.46 NA
2020-10-20 AFG Asia 38.455 16.67 0.498
2020-10-20 AGO Africa 7.515 71.30 0.581
2020-10-20 AIA North America NA NA NA
2020-10-20 ALB Europe 157.759 42.59 0.785
Number of NA values in plot2_data
date iso_code continent total_deaths_per_million stringency_index human_development_index
0/214 1/214 2/214 23/214 42/214 33/214
Creating the bubble plot
plot2_title = paste0(
  "Total # of Deaths Across Gov. Stringency Index", 
  " and Human Development Index"
)

plot2_data %>%
  filter(!is.na(continent)) %>% # remove 'NA' from the plot's legend
  ggplot(aes(x = stringency_index, y = human_development_index,
             size = total_deaths_per_million, color = continent)) +
  theme_bw() +
  geom_point(alpha = .5) +
  scale_size_continuous(
    name = "Total Death (Per Mil.)",
    breaks = seq(100, 1300, 300),
    range = c(2, 14)
  ) +
  scale_color_brewer(
    name = "Continent", 
    palette = "Set1"
  ) +
  scale_x_continuous(
    name = "Stringency Index", 
    breaks = seq(0, 100, 10)
  ) +
  scale_y_continuous(
    name = "Human Development Index",
    breaks = seq(0, 1, 0.1)
  ) +
  ggtitle(plot2_title) +
  theme(
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 12),
    axis.title = element_text(size = 10)
  ) +
  guides(color = guide_legend(override.aes = list(size = 3))) # enlarge the symbol size in the color legend

STATA

Reading in and pre-processing the data
cd "E:\git\Stats506_midterm_project\STATA\"
import delimited "..\Data\owid-covid-data.csv", clear

// keep only relevant variables & data points
keep if date == "2020-10-20"
keep iso_code continent total_deaths_per_million ///
     stringency_index human_development_index
drop if continent == "" //drop international world data points

// drop observations if any of the index measures are missing
drop if stringency_index ==. | ///
        human_development_index==. | ///
            total_deaths_per_million ==.

save covid_plot_data, replace
Creating the bubble plot

Note: The optacity settings (e.g. mcolor(red%30)) are only available in STATA15 or above. For STATA14 or below, try using hollow circles instead of solid ones by adding the option msymbol(Oh).

use covid_plot_data, clear

// categorize the human_development_index into six groups by continents
separate human_development_index, by(continent) veryshortlabel

local vars human_development_index? stringency_index ///
      [aw = total_deaths_per_million]
local symbol O // solid circles
local size .75 // make all points smaller (75% of the original size)
local alpha %30
local options xscale(range(0 80)) yscale(range(0.32 1)) ///
            mcolor(red`alpha' blue`alpha' green`alpha' ///
                   purple`alpha' orange`alpha' yellow`alpha') ///
            msymbol(`symbol' `symbol' `symbol' `symbol' `symbol' `symbol') ///
            msize(`size' `size' `size' `size' `size' `size')
local graph_title = ("Total # of Deaths Across Gov. Stringency Index " + ///
                     "and Human Development Index")
scatter `vars', `options' ///
  legend(subtitle("Continent") position(3) cols(1) ///
           region(style(none))) ///
  title("`graph_title'", size(*.7)) ///
  ytitle("Human Development Index") xtitle("Government Stringency Index") ///
  yscale(titlegap(3)) // enlarge the gap between y-axis title and ticks
                      // to avoid overlap

// save the plot
graph export stata-bubbleplot.png, replace

MATLAB

Bubble plot is a relatively new function (bubblechart) introduced to MATLAB in R2020b. We’ll need a minimum version of R2020b to be able to run this tutorial.

Reading in and pre-processing the data
%% BubblePlots
% For ref see: https://www.mathworks.com/help/matlab/ref/bubblechart.html

%% Data Prep
fileName = './Data/owid-covid-data.csv';
covid_orig = readtable(fileName);

% Select 2020-10-20 Data
newestDate = datetime([2020  10  20]);
covid = covid_orig(covid_orig.date == newestDate,:);

Checking the # of NaN values for the stringency index:

sum(isnan(covid.stringency_index))
ans = 40

Checking the # of NaN values for the human development index:

sum(isnan(covid.human_development_index))
ans = 33
% Variables of interest (VOI)
voi = {'location', 'continent', 'total_cases_per_million', ...
       'total_deaths_per_million','stringency_index',...
       'human_development_index'};

% Remove international and world info and keep only variables of interest
covid = covid(1:end-2, voi);

We want to group the countries by its continent membership to see if there’re any patterns to certain continents being more stringent in COVID-19 restrictions or if certain continents are being impacted more heavily with COVID-19 deaths.

% Convert continent label to color label to be used in bubble plot
for k = 1 : size(covid, 1)
    if strcmp(covid.continent(k), 'Africa')
       covid.continent{k} = 1;
    elseif strcmp(covid.continent(k), 'Asia')
       covid.continent{k} = 2;
    elseif strcmp(covid.continent(k), 'Europe')
        covid.continent{k} = 3;
    elseif strcmp(covid.continent(k), 'North America')
        covid.continent{k} = 4;
    elseif strcmp(covid.continent(k), 'Oceania')
        covid.continent{k} = 5;
    elseif strcmp(covid.continent(k), 'South America')
        covid.continent{k} = 6;
    end
end

covid = sortrows(covid, 2);

x = table2array(covid(:,5)); % x-axis: Stringency index
y = table2array(covid(:,6)); % y-axis: Human development index
sz = table2array(covid(:,4)); % size: Total deaths per million
c = cell2mat(table2array(covid(:,2))); % color: continent

In MATLAB, to create a bubble chart with different grouping colors, different groups have to be plotted separately and then overlaid on top of one another. We can pass in an array of RGB values into bubblechart as the third argument to specify the color we want the bubbles to be for each group.

Creating the bubble plot
% Stringency index of countries grouped by continent
x1 = x(c==1); x2 = x(c==2); x3 = x(c==3);
x4 = x(c==4); x5 = x(c==5); x6 = x(c==6);
% Human development index of countries grouped by continent
y1 = y(c==1); y2 = y(c==2); y3 = y(c==3); 
y4 = y(c==4); y5 = y(c==5); y6 = y(c==6);
% Total Deaths per mil. of countries grouped by continent
sz1 = sz(c==1); sz2 = sz(c==2); sz3 = sz(c==3);
sz4 = sz(c==4); sz5 = sz(c==5); sz6 = sz(c==6);

%% Plot data in a tiled chart layout
figure

% colors for plotting
red = [234, 82, 84]/255;
blue = [105, 158, 201]/255;
green = [99, 185, 96]/255;
purple = [177, 122, 186]/255;
orange = [255, 186, 117]/255;
yellow = [255, 255, 153]/255;

t = tiledlayout(1,1);
nexttile
bubblechart(x1, y1, sz1, red) % Africa
hold on
bubblechart(x2, y2, sz2, blue) % Asia
hold on
bubblechart(x3, y3, sz3, green) % Europe
hold on
bubblechart(x4, y4, sz4, purple) % N. America
hold on
bubblechart(x5, y5, sz5, orange) % Oceania
hold on
bubblechart(x6, y6, sz6, yellow) % S. America
hold off

sgtitle('Total # of Deaths Across Gov. Stringency Index and Human Development Index')
xlabel('Stringency Index')
ylabel('Human Development Index')

We want to include both the legend for the continents and the legend for the size of the bubbles (bubblelegend). In order to include these 2 separate legends, we have to specify a layout for the tile.

blgd = bubblelegend('Total Death (Per Mil.)','Location','eastoutside');
lgd = legend('Africa','Asia', 'Europe', 'North America', ...
    'Oceania', 'South America');
title(lgd, 'Continent')
blgd.Layout.Tile = 'east';
lgd.Layout.Tile = 'east';

While the default value of the transparency of the bubbles is set at 60% opacity, we’ll bring it down to 40% so it’s easier to see all the points since we have a large number of data points.

b1.MarkerFaceAlpha = 0.4; b2.MarkerFaceAlpha = 0.4;
b3.MarkerFaceAlpha = 0.4; b4.MarkerFaceAlpha = 0.4;
b5.MarkerFaceAlpha = 0.4; b6.MarkerFaceAlpha = 0.4;

Python

In the Python part, we make implement Bubble Plot using two different packages: seaborn and matplotlib. These two packages make plots by different ways which worth us to have a comparison between them.

Set up packages
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
Reading in and pre-processing the data
df2 = pd.read_csv("./Data/owid-covid-data.csv")
new2 = df2[df2['date'] == '2020-10-20']
new2 = new2[['continent', 'total_deaths_per_million', 
             'stringency_index', 'human_development_index']]
new2 = new2.dropna(axis=0, how='any')
Creating the bubble plot
Seaborn

In Seaborn, we are going to use the seaborn.scatterplot function to implement the bubble plot. The size parameter takes the value of total_deaths_per_million which indicates the radius of each bubble. Here we set the transparency indicator alpha to be 40% to help us visualizing bubbles which are overlapped.

# draw plot using seaborn
plt.figure(figsize=(11, 7));
sns.scatterplot(x='stringency_index',
                y='human_development_index',
                size=new2['total_deaths_per_million'],
                sizes = (10, 1250),
                alpha=0.6,
                hue=new2['continent'],
                data=new2);

# Put the legend out of the figure
<matplotlib.axes._subplots.AxesSubplot object at 0x0000013054F6E8D0>
plt.legend(bbox_to_anchor=(1.03, 1),borderaxespad=0,
           prop={'size':12}, handlelength=2,
          frameon=False);
plt.rcParams.update({'legend.labelspacing':1.7});
plt.xlabel('Stringency Index');
plt.ylabel('Human Development Index');
plt.title('Total # of Death Across Stringency Index & Human Development Index',
         fontsize = 13);
plt.grid(color='LightGrey');
plt.tight_layout();
plt.show()
# plt.savefig('python_bubble_seaborn.png', dpi=200, bbox_inches='tight')

Matplotlib

Here we use the matplotlib.scatter function to implement the bubble plot. Different from Seaborn, we use a loop here to combine continents with colors.

continent_list = ['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']
color_list = ['red', 'blue', 'green', 'purple', 'gold', 'pink']
colors = {'Africa':'red', 'Asia':'blue', 'Europe':'green',
          'North America':'purple', 'Oceania': 'gold',
          'South America': 'pink'}
l = []
for i in range(6):
    l.append(mpatches.Patch( color=color_list[i],
                             alpha=0.7,
                             label=continent_list[i]))
    
plt.figure(figsize = (11,8));
plt.scatter('stringency_index',
            'human_development_index',
            s='total_deaths_per_million',
            c=new2['continent'].apply(lambda x: colors[x]),
            alpha=0.7,
            data=new2);
plt.grid(color='LightGrey');
plt.xlabel('Stringency Index');
plt.ylabel('Human Development Index');
plt.title('Total # of Death Across Stringency Index & Human Development Index');
legend1 = plt.legend(handles=l, loc = (1.02,0.78));
plt.show()
# plt.savefig('python_bubble_matplotlib.png', dpi=200, bbox_inches='tight')


Interactive plot

Interactive plots allow us to manipulate the viewing of the data points plotted and can be useful in focusing on specific aspect of the data. In this example, we will plot an interactive histogram comparing the # of cases, deaths, tests performed, and hospitalized across a few different countries in Europe (dataset). Clicking the legend in the plot will either hide or show the bar associated with the specific item in the legend.

  • Variables
    • ID: location(iso_code) & date
    • total_cases_per_million
    • total_deaths_per_million
    • hosp_patients_per_million
    • total_tests_per_thousand
  • This figure presents only the Oct 20, 2020 data.

R

# setup
library(tidyverse)
Reading in and pre-processing the data
filename_covid = "./Data/owid-covid-data.csv"

plot3_data = read_delim(
  filename_covid, delim = ",",
  col_types = cols( # specify the types for our variables of interest
    # otherwise some will be read in as 'logical' and lead to incorrect values
    date = col_character(),
    iso_code = col_character(),
    location = col_character(),
    total_cases_per_million = col_double(),
    total_deaths_per_million = col_double(),
    hosp_patients_per_million = col_double(),
    total_tests_per_thousand = col_double()
  )
) %>%
  select(date, iso_code, location, 
         total_cases_per_million, 
         total_deaths_per_million,  
         hosp_patients_per_million, 
         total_tests_per_thousand) %>%
  filter(
    iso_code %in% c("AUT", "BEL", "BGR", "CZE", "DNK"),
    date == "2020-10-20"
  ) %>%
  mutate(
    total_tests_per_thousand = total_tests_per_thousand * 1e03
    # turn 'per thousand' to 'per million'
  ) %>%
  pivot_longer(
    cols = total_cases_per_million:total_tests_per_thousand,
    names_pattern = "(.*)_per", 
    names_to = "variable"
  ) %>%
  mutate(
    variable = factor(
      variable, 
      levels = c("total_cases", "total_deaths", 
                 "hosp_patients", "total_tests"),
      # add labels so that the legend items in the plot make more sense
      label = c("Cases", "Deaths", "Hospitalizations", "Tests"),
      # order the factor levels to control for the order of variables
      # displayed in the bar plot
      ordered = TRUE)
  )
Format of plot3_data
date iso_code location variable value
2020-10-20 AUT Austria Cases 7395.963
2020-10-20 AUT Austria Deaths 102.039
2020-10-20 AUT Austria Hospitalizations 82.608
2020-10-20 AUT Austria Tests 218961.000
2020-10-20 BEL Belgium Cases 22606.875
Number of NA values in plot3_data
date iso_code location variable value
0/20 0/20 0/20 0/20 0/20
Creating the interactive plot

Using library ggplotly, we can easily add interactivity to a ggplot object, and the interactivity will be retained in an HTML output of Rmarkdown.

Try the following:

  • Double click an item on the legend to single out that character, and double click again to show all characters.
  • Single click an item to hide / re-display that character.
plot3_title = "Comparison of # of Cases, Deaths, Hospitalizations, and Testing"
plot3_base = plot3_data %>%
  ggplot(aes(x = location, y = value,
             fill = variable)) +
  theme_bw() +
  geom_col(position = "dodge", width = .7) +
  scale_fill_brewer(
    name = "Total Number", 
    palette = "Paired"
  ) +
  scale_x_discrete(name = "Countries") +
  scale_y_continuous(name = "# per Mil.") +
  ggtitle(plot3_title) +
  theme(
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 12),
    axis.title = element_text(size = 10)
  )

plot3_base %>%
  plotly::ggplotly()

Due to the large data range, we use a log10 scale to transform the y-axis value to ensure all variables of interest are visible.
Note that when the mouse hovers over a bar, the value presented will still be the original value before scaling. That way we can know the exact value without referencing y-axis labels.

# transform the y-axis values using a log10 scale
{plot3_base +
    scale_y_continuous(
      name = "log10(# per Mil.)", 
      trans = "log10"
    ) + 
    theme(
      axis.text.y = element_blank() # remove the y-axis label
      # because they are not helpful due to the large data range
    )
  } %>%
  plotly::ggplotly()

Note that the legend position is higher than static plots. We can move the legend items downwards using plotly::layout(), and move the legend title downwards by (i) hiding the original legend title, and (ii) add a new legend title using plotly::add_annotations.

{plot3_base + 
    scale_y_continuous(name = "log10(# per Mil.)", trans = "log10") +
    theme(
      axis.text.y = element_blank(),
      legend.title = element_blank() # hide the legend title
    )
  } %>%
  plotly::ggplotly() %>%
  # move legend downwards
  plotly::layout(legend=list(y=0.8, yanchor="top")) %>% 
  # add the legend title back but to a lower position
  plotly::add_annotations(
    text="Total Number",
    xref="paper", x=1.02, xanchor="left",
    yref="paper", y=0.8, yanchor="bottom",
    legendtitle=TRUE,
    showarrow=FALSE
  )

MATLAB

In this tutorial, we’ll go over 2 different ways we can introduce interactive feature with MATLAB. 1.) Clickable Legend Toolbox is a MATLAB FileExchange toolbox that allows users to hide or unhide data points by clicking on the grouping variable in the legend. 2.) plotly-graphing-library-for-matlab is a toolbox written by Plotly distributed through github, allowing MATLAB users to create and export interactive charts in the web brower by incorporating some of the same functionality of Plotly offered in R and Python. However, it should be noted that the functionality built into MATLAB plotly is much more restricted than those available in R and Python. After the preprocessing step, the tutorial will be divided to show these 2 different methods.

Reading in and pre-processing the data
%% Data Prep
addpath('clickable_legend')
fileName = 'owid-covid-data.csv';
covid_orig = readtable(fileName);

% Select 2020-10-20 Data
newestDate = covid_orig.date(217);
covid_hosp = covid_orig(covid_orig.date == newestDate,:);
% Keep only countries that have hospitalization data
covid_hosp = covid_hosp(~isnan(covid_hosp.hosp_patients_per_million),:);

% Plotting the following variables:
%   # of deaths, # of cases, # of tests, # hospitalizations
voi = {'location', 'total_cases_per_million', ...
    'total_deaths_per_million','hosp_patients_per_million',...
    'total_tests_per_thousand'};

% Update dataset to only include variables of interest
covid_hosp = covid_hosp(:, voi);
% Convert total_tests_per_thousand to total_tests_per_million
covid_hosp(:,'total_tests_per_thousand') = ...
table(1000*covid_hosp.total_tests_per_thousand);
covid_hosp.Properties.VariableNames{5} = 'total_tests_per_million';
Creating the interactive plot
Clickable Legend

Clickable Legend Toolbox was downloaded from MATLAB FileExchange in order to make the legend interactive. It should be noted that the original toolbox had a small bug associated with a coloring issue in the function (togglevisibility) that had to be corrected. The updated version of the toolbox is available in the project repository on github.

Clickable Legend functionality can be added to any plots that has a multiple groups that can create a legend. In our case, we use bar graphs to compare the number of cases, deaths, hospitalization, and testing across 5 different European countries. However, it can be added to line plots and scatter plots (by getting rid of the line and specifying marker shape and size).

%% Plotting 
nCountries = 5;
X = categorical(covid_hosp.location(1:nCountries));
Y = [covid_hosp{1:nCountries,2}, covid_hosp{1:nCountries,3},...
    covid_hosp{1:nCountries,4}, covid_hosp{1:nCountries,5}];
figure
bar(X,Y)

Once the base plot has been plotted, we add the interactive functionality by calling function ‘clickableLegend’ and enable the toggling of the visibility of the bars by clicking on the text shown in the legend. In calling the function (clickableLegend), we specify the name of each legend entry and the optional location of the legend, through passing argument pair, (‘Location’, ‘eastoutside’).

Note: The y-axis was log scaled for better comparison of values across different statistics.

clickableLegend({'Case','Death','Hosp','Test'}, 'Location', 'eastoutside');
title('Comparison of Cases, Deaths, Hospitaliations, and Testing')
xlabel('Countries')
ylabel('log10(# per million)')
set(gca, 'YScale', 'log')

Unfortunately, MATLAB figure created this way does not retain its interactive functionality through an html export, to see it in action, the example code will have to be executed in MATLAB. The .m script of this code is available in the project github repository. Alternatively, follow the instructions using plotly for MATLAB to easily export interactive html figure.

plotly (MATLAB)

To use plotly for MATLAB, we download the github repository as listed above and add it to our MATLAB path. If this is your first time trying to use plotly on MATLAB, you’ll have to run a single line command, plotlysetup('your_username', 'your_api_key') with the username and API key that you obtain from plotly website. Further instructions are included in the plotly for matlab github repository.

addpath('./plotly-graphing-library-for-matlab-master');

To create an interactive grouped bar chart, each of the variables being plotted, in our case, # of cases, deaths, hospitalizations, and tests, have to be organized into structs with options specifying the overall grouping variable (by country), variable name, type of plot (bar). Additional paramters, like the color of the groups can be added to customize the figure.

countries = covid_hosp.location(1 : nCountries);
logY = log10(Y);

trace1 = struct(...
  'x', { countries }, ...
  'y', logY(:,1), ...
  'name', 'Total # Cases', ...
  'marker', struct('color', 'rgb(166, 206, 227)'),...
  'type', 'bar');
trace2 = struct(...
  'x', { countries }, ...
  'y', logY(:,2), ...
  'name', 'Total # Deaths', ...
  'marker', struct('color', 'rgb(29, 124, 180)'),...
  'type', 'bar');
trace3 = struct(...
  'x', { countries }, ...
  'y', logY(:,3), ...
  'name', 'Total # Hospitalizations', ...
  'marker', struct('color', 'rgb(178, 223, 138)'),...
  'type', 'bar');
trace4 = struct(...
  'x', { countries }, ...
  'y', logY(:,4), ...
  'name', 'Total # Tested', ...
  'marker', struct('color', 'rgb(52, 164, 44)'),...
  'type', 'bar');

We finally combine all of the structs for each variable then pass this combined struct array into yet another struct which puts it all together with other parameters like the title, yaxis, xaxis, legend. With the plotly function, we generate the figure (within MATLAB window) and are also provided a url for this figure which is automatically uploaded to your plotly account.

data = {trace1, trace2, trace3, trace4};
layout = struct('title', ...
    'Comparison of # of Cases, Deaths, Hospitalizations, and Testing',...
    'yaxis', struct(...
      'title', 'log10(# per million)'),...
    'xaxis', struct(...
      'title', 'Countries'),...  
    'barmode', 'group', ...
    'legend', struct(...
      'x', 1.0, ...
      'y', 1.0, ...
      'bgcolor', 'rgba(255, 255, 255, 0)', ...
      'bordercolor', 'rgba(255, 255, 255, 0)'));
response = plotly(data, struct('layout', layout, ...
    'filename', 'grouped-bar', 'fileopt', 'overwrite'));
plot_url = response.url;

We can then go to the plot_url to export the figure as an html figure on top of the figure generated within MATLAB that is already interactive.

Python

Set up packages

We use the python package plotly to implement our interactive plot. The package contains a lot of helpful functions which enable the user to interact with elements in the plot.

import numpy as np
import plotly.express as px
import plotly.offline as offline
import plotly.graph_objs as go
Reading in and pre-processing the data

Similar to the Bubble Plot example, we still use the data of Oct 20, 2020.

# will be using the same dataset as in the bubble plot section
new3 = df2[df2['date'] == '2020-10-20']
new3 = new3[['location', 'total_cases_per_million', 'total_deaths_per_million',
             'hosp_patients_per_million',  'total_tests_per_thousand']]
new3 = new3.dropna(axis=0, how='any')[0:5]
Creating the interactive plot

Here we first use the go.Bar() function to create 4 traces which represent the 4 categories in our plot: total_cases_per_million, total_deaths_per_million, hosp_patients_per_million, and total_tests_per_thousand. Inside the go.Bar() function, we specify the variable and its color. Later we are going to combine these four traces and make the plot using go.Figure() function. Meanwhile, the corresponding legend for this plot will be generated automatically.

branches = new3['location']
total_cases_per_million = np.log10(new3['total_cases_per_million'])
total_deaths_per_million = np.log10(new3['total_deaths_per_million'])
hosp_patients_per_million = np.log10(new3['hosp_patients_per_million'])
total_tests_per_thousand = np.log10(new3['total_tests_per_thousand']*1000)

trace1 = go.Bar(
    x = branches,
    y = total_cases_per_million,
    name = 'Cases',
    marker = {'color': 'lightblue'}
)
trace2 = go.Bar(
    x = branches,
    y = total_deaths_per_million,
    name = 'Deaths',
    marker = {'color': 'steelblue'}
)
trace3 = go.Bar(
    x = branches,
    y = hosp_patients_per_million,
    name = 'Hospitalizations',
    marker = {'color': 'darkseagreen'}
)
trace4 = go.Bar(
    x = branches,
    y = total_tests_per_thousand,
    name = 'Tests',
    marker = {'color': 'forestgreen'}
)
data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    title = 'Comparison of Cases, Deaths, Hospitalizations, and Testing',
    barmode = 'group')
fig = go.Figure(data = data, layout = layout)
fig.update_xaxes(title_text='Countries');
fig.update_yaxes(title_text='log10(# per Mil.)');
fig.update_layout;
offline.plot(fig, filename='interactive.html');


Discussion

Data visualization can prove to be an immensely useful tool when trying to understand and organize large amounts of data. In this tutorial, we reviewed a few graphical concepts, marginal histogram, bubble plot, and interactive plots to demonstrate the functionality of these different graphical concepts and how they may facilitate a better grasp of the pattern captured in the data.

Comparison across software/tools

The tutorial is carried out in 4 different programming languages (R, STATA, Python, MATLAB). When it comes to data visualization, the two open-source software, R and Python take the clear lead in flexibility and functionality over the two proprietary languages, STATA and MATLAB. The open-source nature of R and Python has allowed the development of a vast number of user written libraries capable of producing publication quality plots with large customizability of the plots.

However, at the cost of less customizability on part of the user, plotting in MATLAB and STATA tends to be a little more succinct and straightforward, with many default parameters in place that do not need to be specified. We also find that STATA’s macros come in really handy when we want to unify settings across multiple figures without writing duplicated code.

However, we would note that once we begin to try to implement the same level of specifications in the plots as these open-source software, MATLAB and STATA lose this advantage. For example, with R’s ggplot2, we can easily change aesthetic options. If we want to enlarge the symbol size in a color legend while keeping the symbols in the main plot untouched, simply using guides(color = guide_legend(override.aes = list(size = 3))) will do. The same task can take much more effort in STATA: we have to create another layer of scatterplot that does not have any valid data points (so the plotting area will be empty) but has larger symbol size (for the legend). See the code example below:

sysuse auto
twoway /// the 3rd and 4th layers are empty scatter plots with larger msize
    (scatter mpg weight if foreign==0, msymbol(o) mcolor(orange)) ///
    (scatter mpg weight if foreign==1, msymbol(o) mcolor(green))  ///
    (scatter mpg weight if mpg==., msymbol(o) mcolor(orange) msize(*3)) ///
    (scatter mpg weight if mpg==., msymbol(o) mcolor(green)  msize(*3)), ///
    legend(order(3 4) label(3 domestic) label(4 foreign))

Python contains a lot of libraries, which allows different implementations of the same plot type. For instance, the bubble plot can be implemented using either matplotlib or Seaborn. Interestingly, the matplotlib package inherits lots of nice properties from MATLAB. Hence it is possible to make plots similar to MATLAB using this package.

It is a little bit difficult to implement complicated legend in Python using matplotlib (e.g. the size legend in the bubble plot). The same graph in Python can take more lines of code than using MATLAB. It is also worth noting that Python programs become much slower when the data size gets larger.